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I , INTRODUCTION 

One of the more Interesting analytical tasks performed by the Facility 
in 1971 was that concerned with doing a detailed study of the sizes 
of documents 1 11 the ERIC data base. 

This topic had arisen earlier in the year in connection with the 
letting of the contract for the ERIC Document Reproduction Service 
(EDRS). It was discovered that there was little daflnltlve Information 
which could be given to prospective bidders on which they could base 
their cost calculations. A few analyses based on samples had been 
done but these had not allowed for complexities such as: ( 1 ) variation 

over time^ i.e., 1967 being different from 1970 ; ( 2 ) variation by 
type of availability, i.e.. Level I documents being different from 
Level I I 1 *^; ( 3 ) variation by different type of average, i,e., mean being 
different from median or mode. Just about the only useful place of 
data that had come from previous samplings was that everybody agreed 
the average (mean) document size was somewhere around 70 pages* 

It was decided by Central ERIC, therefore, that a definitive study 
should be made of the sizes of documents in the ERIC data base* This 
would not be a sample; it would be a statistical and frequency distribution 
study encompassing every item on the file whose record contained 
pagination data In mach 1 ne*- readab 1 e form* it would settle the size 
question once and for all; it would provide pagination data that could 
be used by present and future EDRS contractors and bidders; and it 
would be one of the few such comprehensive analyses ever performed on 
a data base of this size, and in that sense would be useful data for 
the entire Information science community. 

II. STATEMENT OF WORK 

With the above considerations in mind, the following Statement of Work 
was added to the ERIC Facility contracti 

Prepare Frequency Distribution Analysis Report of Pagination Field . 

Write, compile, test, de--bug and run computer programs to analyze the 
pagination field in all ED accessions in the data base to determine 
frequency of occurrence of reports of all sizes (number of pages). 

Prepare graphic reports in the general format of Attachment 1 on 
segments of the file as follows: 

^a) By period - Report for each period as follows: 1966-1967 

I Tc^Hne^; 1968; 1969 ; 1970 ; 1966^1970. 

(b) By Type - For each period listed in (a) above, display pagination 
frequency for Level I , Level I I , Level III, Levels I and I I , and 
tota 1 (Leve Is I , II, III). 

^Level I = MF S' HC ; Level I I - MF Only; Level ||| s Mot Available from EDRS 

Er|c 4 



I I . STATEMENT OF WORK (CONTINUED) 



Ca leu 1 at i 


ons 


- For 


each 


of the 


five 


(5) 


each 


per I 


od. 


ca 1 cu 


1 ate 


and spe 


cl fy : 




(0 


Mean 




Absolute Average, 


Total 


page 


(2) 


Med I 


an 


- 50 th 


percent I 1 e 


poi nt 


(ha 1 




ha 1 f 


are sma 1 ^ 


1 er) . 








(3) 


Mode 


- 


Location of 


the h i 


chest 


poi n 



Mi. METHODOLOGY 



A • ED Range Analyzed 

The ED range analyzed was that which applies to Research In Education , 
It extends from ED-010000, the first number in the first issue of 
R I E (November 1966), to ED«r0485l6, the last number in the June 1971 
RIE . 

It was decided not to analyze the early non^ RIE collections, o-g,. 
Disadvantaged (ED-001001 - 002740 = 1*740 accessions) and Research 
Reports _1 956 - 1965 (ED -002747 - 003960 ^ 1,214 accessions) for the 
fol lowing re as oris : 

1 . The D i sadvantaged col 1 ect ion is a s pec i a 1 I zed co 1 1 ect i on of 
documents and does not cover the entire spectrum of the 
literature as does R I E . It may, therefore, be atypical in 
certain respects, i ne lud i ng possibly pagination, and should 
not be included for this reason. 

2. The pre- R I E records are less reliable In their content and 
format tTien R I E records, having been converted from some 
earlier machine-readable format. They present more problems 
to the program (variables and spec I a r cases) than the RIE 
records and would result in a higher percentage of exception 
records (printed out by the program as unprocessable). 

3 . The focus of the study is RIE * It would provide a mors concise 
and more easily remembered explanation If one could state that 
the study simply covers R I E 1966-1971 •- 

It should be said, however, that the pagination analysis programs 
could easily be run against this early non -RlE data (2,954 accessions) 
should OE desire at any time In the future to do so. We merely 
felt that for purposes of this study there was adequate justification 
for excluding them# 
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III. METHODOLOGY (CONTINUED) 



B. Accessions Analyzed 

Between ED-OlOOOO and ED-048516 there are a theoretical total 
of 38,516 records that could have been considered for statistical 
analysis. The actual number of records that were considered by 
the programs was 36,3l4. The 2,202 RiE records that were not 
considered consist of the following kinds of records: 

1, AIM and ARM Collections - These items consist of collections 
of documents under one ED number. These collections sometime 
amount to hundreds of documents with total paginations up 
to 24,000 pages. If included they would have artificially and 
wrongfully skewed the distribution and the averages to a 
significant extent. 

2- Non-Print I terns - records, tapes, films, etc. There are 

very few such items in the ERIC data base. Where they appear 
they generally lack pagination data. 

3. Accessions Bearing No Pagination Data 

These are relatively rare in all years except I 968 when there 
appears to have been a policy of not paginating Level III 
documents. No Level III documents in I968 have pagination 
data in thelT records. Th I s Ti the largest category ot 
cTocuments not analyzed, comprising over 90% of the total 
amount not considered. 

4. Acce ssions Bearing Garbled Pagination Data 

The pagination data in ERIC records appears In the Descriptive 
Note field. Other types of data may also appear in this field. 
When this occurs the data fields are separated by a semi-colon. 
In order to zero in on the, pag 1 nat I on figure and to distinguish 
It from the other data in the field It was necessary to test 
for a distinguishing characteristic. This characteristic was 
the "p," which generally appears after the numeric pagination 
value. In some few cases no "p." existed; either one or both 
of the characters were missing. A record containing this 
flaw was printed out as an exception record and was not 
analyzed. These were rel at i ve 1 y few compared to the accessions 
bearing no pagination data. 
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III. METHODOLOGY (CONTINUiP) 



B. Accessions Analyzed (Continued) 

5 . Accessions Missing From the File 

The count of 1968 records on the ERIC Resume Master File 
turned out to be 15 less than what it should theoretically 
have been, indicating 15 missing records. A1 1 other years 
checked out exact ly. This validation of file content was 
an unexpected dividend of the pagination analysis work. 

In summary, of the 38,516 records that we theoretically started 
with, a total of 2,202 (or 5%) were eliminated as "defective" 
in the various ways enumerated above, and were not permitted to 
Impact the statistics presented in the tables and graphs of this 
report. The vast majority of the "defective" records were 1968 
Level III documents, containing no pagination data to analyze. 

The 36,314 records analyzed constitute virtually the complete 
ERIC Data Base through June 1971 and are obviously more than a 
sufficient base from which to draw generalizations. 

C. Data Display 

The data which constitute this report are displayed in two 
formats: Tables and Graphs, in order to accommodate the varying 

preferences of the many different individuals who will be making 
use of the figures. The Tables support the Graphs in the sense 
that they present many of the specific numbers represented by 
points on the graph lines. The Graphs support the Tables in 
the sense that they show, in one easi ly grasped picture, what 
the numbers are doing, how they are moving, what the trends 
are, etc. Behind both Tables and Graphs are the detailed 
computer printouts representing the raw data from which both 
displays were developed. The computer printouts will be 
maintained in storage ■ i ndefi n I te 1 y in order to obviate the need 
to ever run again the early part of the file. It is envisaged 
that future runs wi 1 1 probably be made at the end of each year 
to cover that year. 

1 . Tables 

The Tables are. all arranged in the same basic matrix format. 
Down the left hand side there are arranged the various 
"Levels", i.e., I, II, III, I and || (the "aval lables") , 

I and II and III (ths composite). Across the top are 
arranged the various types of averages, i .e, , mode, mean. 
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O 

ERIC 



111 . 



C • Data Display (Continued) 

1 , Tables (Cont f nued) 

median. Also provided are figures for the total number of 
documents at the mode value , the total documents at each 
"Level", and the total pages at each "Level"* In other 
words, virtually everything that anyone could ask about 
pagination has been presented. 

There are tables for each calendar year and each page Is 
labeled with the accession range that occurred In that year. 
Because of the small number of documents announced in ig66 
(two R I E issues), 1966 and 19&7 data were grouped together, 

A super-composite table, covering the years 1966^1971 (June), 
presents the data for the entire 36,31^ records analyzed, 
and Is, In effect, all the other tables combined into one. 

2 , Graphs 

Each Table has its corresponding graph, |t was found that , 
each graph falls off very rapidly toward zero after you pass 
220 pages, but each graph also strings out in very attenuated 
fashion up to some very high page values, F^r example, the 
1970 distribution ranges from 1 - 3261 pages but 88,95 
percent of the documents are below 220 pages. This presented 
a problem In display, as the horizontal axis threatened to 
completely dominate the vertical axis. Since after 220 
pages the graph lines are all essentially flat and "hugging" 
the horizontal axis, it was finally decided to stop each 
graph at that value and provide the percentile reached at 
that point. This treatment would allow reasonable pagination 
ranges to be used on the horizontal axis and would prevent 
a compression effect to the left that would have made the 
graphs much less meaningful and informative. To demonstrate 
the effect of this kind of- d i s tr i but I on on a display, we then 
decided to do one graph that would encompass the entire pagination 
range of a given period of time without truncation* We decided 
upon the two most recent full years, 1969 ^ 1970, It is easy 
to see, from an examination of this last graph, y^hy the decision 
to stop the regular graphs at 220 pages was a reasonable 
compromi se . 

In addition to Its basic curve, each graph comes equipped with 
a small inset table summer i z i ng the most signiflcf^nt values 
shown on the graph. This table also presents the pagination 
range dealt with by the graph and calculates the 75% percent i le 
and the percentile at 220 pages, where the graph truncates. 

These inset tables will in almost all cases make It unnecessary 
to refer back to the 'Table Only" section, 

8 
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III. METHODOLOGY (CONTINUED) 
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C . Data Display (Continued) 

2 . Graphs (Continued) 

Vertical lines have been drawn on each graph representing 
the range mode, the mean, the median ( 50 th percentile), 
and the 75th percentile- A word is perhaps In order 
concerning the range mode. The mode is that page value 
that is most frequently represented in the collection* We 
call this the "individual mode". Since, however, for 
graphing purposes it was necessary to select pagination 
ranges to plot points against (rather than having each 
individual page value represented on the horizontal axis), 
we ended up with a pagination range that was the most 
frequently represented In the collection. We called this 
the "range mcde". I n the vast majority of cases the individual 
mode fell within the range mode, but not always. For 
example. In 1970 the overall individual mode was 10 pages 
(where there were 212 documents). However, the range mode 
worked out to be 11-20 pages. This is a minor irritant 
however, for It takes very little work with "these tables and 
graphs to become convinced that the mode is probably the 
least useful of the averages calculated* It is of curiosity 
value only, for Instance, that in all the graphs there was 
only one bl-modal d I stri but Ion. 1971 Level III documents 
had two Individual modes: 21i-30 pages and 121-130 pages, 

each range having 65 documents representing It. 

D . Special Microfiche Analysis 

The median for the combined group of EDRS-avai lable documents 
(Level I and II) |s surprisingly low (39)* This suggests that an 
unexpectedly large number of ERIC documents will fit on one fiche 
and raises the interesting question of the relationship between 
reduction ratio (I.e., in effect the number of pages that will fit 
on one fiche) and the percentage of the collection that will fit on 
one fiche. We have not explored this subject completely but have 
restricted ourselves to calculating exactly what percentile of 
the ERIC collection It is that will fit on one fiche, two fiche, 
three fiche, etc., at the existing reduction ratio- The special 
Table presenting this data is organized on the vertical axis by 
''cards/pages", i.e., 1 mfcroflc'he bearing documents 1-57 pages in 
size, 2 microfiche bearing documents 58-127 pages In size, etc., 
and across the horizontal axis by year and then by number of 
dOGumants at each card/pages level (with running cumulative 
percentages). The far right columns present the composite values 
for 1966 - 1971 . 

This table, and its associated graph, undoubtedly has the most to 
say to the EDRS contractor. From this data, for example, I t can be 
seen that 61.82% of the total EDRS-avallable documents will fit on 

9 
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ill. METHODOLOGY (CONTINUID) 



D. Special Mfcroftcha Analysis (Cont.) 

one fiche; 83 . 50 % of the total EDRS-avai lable documonts will fit 
on two fiche or less, and so on. By six fiche the graph Is essentially 
on a plateau as only 1 . 57 % of the documents require more than six 
fiche. 

As byproducts of the special microfiche analysis, a table has been 
prepared listing all documents In the data base over 1,000 pages in 
size, and a graph has been prepared subdividing each year's 
accessions by "Availability Level", i .e. , Level 1 , II, or Ml. 

IV. OBSERVATIONS AND CONCLUSIONS 

Each statistician wi 1 1 have his own set of observations about the 
data revealed in this report. The data is now available for many 
analyses and extrapolations. We feel that study of this data 
will cause many useful facts to emerge. 

Since the task of this study was simply to obtain the data and to 
present it in frequency distribution format, we have carefully 
restrained our strong desire to generalize. ; Nevertheless , some basic 
observations are in order: 

Mean - The earlier ideas about the nature of the mean are pretty 
we 1 1 born out. For documents that are available from EDRS in 
hard copy (Level I), it is around 70 pages (1968 = 70 , 1969 = 75 , 

1970 = 68, 1971 " 68 ). For documents that are available only 
in microfiche (Level II), however. It goes up sharply (I968 = 135 , 

1969 - ii5_ 1970 ■ 110 , 1971 = 104 ). For documents that are not 
available at all from EDRS (Level III) the rise is even farther 

(1968 = 215, 1969 = 152, 1970 - 17^, 1971 = 180). 

Shape of Curve - The distribution Is not a normal or be 11 -shaped 
distribution. It is greatly elongated to the right hand side. 

Some elongation was suspected but the extent of it came as a 
surprise. The effect of it is, of course, to pu ! 1 the median 
and the mean away from the mode. The mean is much more affected 
by the presence of large "giant" documents than is the median, 
and is pulled far to the right as a result. 

Mode - The smallness of the mode value was surprising to us. 

We anticipated a value closer to 50 pages , but the mode is 
a 1 mos t un I ve rsa 1 ly I n the 11-20 range. As we said before, the 
genarar peak of the curve (or range mode) is interesting but 
the Individual mode is one of the least interesting and most 
"accidental" figures. 
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JV, OBSERVATIONS AND CONCLUSIONS (CONTINUED) 

M^d 1 an - The median was also smaller than we had originally anticipated, 
lievel I documents it runs around 35 pages (1968 = 36, 1969 - 38, 1970 = 3^ 
1971 ^ 36). For Level II documents it generally runs just over 60 pages 
(1968 = 76, 1969 = 68 , I 97 O = 55, 1971 = 60 ) . Whan these figures are 
combined (I 968 = 38 , I 969 ^ 40, 1970 ^ 37 , 1971 - 39 ), the median for 
the combined group of EDRS-avai 1 ab la documents is surprising low, around 
39 pages. It is the median, and its associated parcantila figures 
that have been converted into the *'spaelal microficha analysis** and 
that have the most to say to the EDRS contractor. 

In closing we would like to suggest that the next logical extension 
of this investigation is to match its data against order data from EDRS. 
Are the document orders equally spread (i.a., normally distributed) 
across the documents available, or do they restrict themselves to 
certain sizes and therefore price ranges? 
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DOCUMENTS PER MICROFICHE 



TABLE 9 




TABLE 10 



EDRS - AVAILABLE DOCUMENTS OVER 1,000 PAGES* 



LEVEL I 



NO^ 


PAGES 


NO. 


PAGES 


NO. 


PAGES 


ED010106 


3,141 


ED024023 


1,717 


ED026546 


1,232 


EDO 10408 


1 ,423 


ED024925 


1,554 


ED026635 


2,943 


E0013197 


1.137 


EDO24936 


1 ,01 1 


ED027369 


1,201 


ED015677 


1.596 


ED024937 


1.326 


EDO32546 


1,567 


ED018676 


1 .287 


ED024942 


1.417 


EDO3628I 


1,352 


ED02120S 


2,140 


ED024943 


1 ,451 


EDO4O303 


1 ,04l 


ED022178 


1 .114 


ED024955 


1,265 


E 0041344 


3,261 


ED022179 


2,707 


EDO24956 


1.545 


ED041685 


1,143 


EDO23083 


2,053 


ED0249S7 


1 ,428 


ED042728 


1,044 


EDO23094 


1.750 


EDO25742 


2,645 


EDO44757 


1,371 


EDO23096 


1 ,071 


EDO25774 


1 ,477 


EDO47798 


1,262 



LEVEL II 



NO, 


PAGES 


NO. 


PAGES 


NO. 


PAGES 


ED0 19980 


3,069 


EDO31697 


1,643 


EDO342O3 


1.720 


EDO3O886 


2,412 


EDO32544 


1 ,268 


EDO3447I 


1.912 


ED031461 


1,451 


EDO33357 


3,178 


EDO38315 


1,322 


EDO3 1688 


3,245 


EDO34196 


3,008 






*Through June 
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